---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
```

# genderBR

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/genderBR)](https://cran.r-project.org/package=genderBR)
[![Travis-CI Build Status](https://travis-ci.org/meirelesff/genderBR.svg?branch=master)](https://travis-ci.org/meirelesff/genderBR)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/meirelesff/genderBR?branch=master&svg=true)](https://ci.appveyor.com/project/meirelesff/genderBR)
[![Package-License](https://img.shields.io/badge/License-GPL-brightgreen.svg)](http://www.gnu.org/licenses/gpl-2.0.html)

`genderBR` predicts gender from Brazilian first names using data from the Instituto Brasileiro de Geografia e Estatistica's 2010 Census.

## How does it work?

`genderBR`'s main function is `get_gender`, which takes a string with a Brazilian first name and predicts its gender using data from the IBGE's 2010 Census -- specifically, from its API and from an internal dataset. 

More specifically, it uses data on the number of females and males with the same name in Brazil, or in a given Brazilian state, and calculates the proportion of female's uses of it. The function then classifies a name as male or female only when that proportion is higher than a given threshold (e.g., `female if proportion > 0.9`, or `male if proportion <= 0.1`); proportions below those threshold are classified as missing (`NA`). An example:

```{r}
library(genderBR)

get_gender("joão")
get_gender("ana")
```

Multiple names can be passed at the same function call:

```{r}
get_gender(c("pedro", "maria"))
```

And both full names and names written in lower or upper case are accepted as inputs:

```{r}
get_gender("Mario da Silva")
get_gender("ANA MARIA")
```

Additionally, one can filter results by state with the argument `state`; or get the probability that a given first name belongs to a female person by setting the `prob` argument to `TRUE` (defaults to `FALSE`).

```{r}
# What is the probability that the name Ariel belongs to a female person in Brazil?
get_gender("Ariel", prob = TRUE)

# What about differences between Brazilian states?
get_gender("Ariel", prob = TRUE, state = "RJ") # RJ, Rio de Janeiro
get_gender("Ariel", prob = TRUE, state = "RS") # RS, Rio Grande do Sul
get_gender("Ariel", prob = TRUE, state = "SP") # SP, Sao Paulo
```

Note that a vector with states' abbreviations is a valid input for `get_gender` function, so this also works:

```{r}
name <- rep("Ariel", 3)
states <- c("rj", "rs", "sp")
get_gender(name, prob = T, state = states)
```

This can be useful also to predict the gender of different individuals living in different states:

```{r}
df <- data.frame(name = c("Alberto da Silva", "Maria dos Santos", "Thiago Rocha", "Paula Camargo"),
                 uf = c("AC", "SP", "PE", "RS"),
                 stringsAsFactors = FALSE
                 )

df$gender <- get_gender(df$name, df$uf)

df
```


### Brazilian state abbreviations

The `genderBR` package relies on Brazilian state abbreviations (acronyms) to filter results. To get a complete dataset with the full name, IBGE code, and abbreviations of all 27 Brazilian states, use the `get_states` functions:

```{r}
get_states()
```

## Geographic distribution of Brazilian first names

The `genderBR` package can also be used to get information on the relative and total number of persons with a given name by gender and by state in Brazil. To that end, use the `map_gender` function:

```{r}
map_gender("maria")
```

To specify gender in the consultation, use the optional argument `gender` (valid inputs are `f`, for female; `m`, for male; or `NULL`, the default option).

```{r}
map_gender("iris", gender = "m")
```

## Installing

To install `genderBR`'s last stable version on CRAN, use:

```{r, eval = FALSE}
install.packages("genderBR")
```


To install a development version, use:

```{r, eval = FALSE}
if (!require("devtools")) install.packages("devtools")
devtools::install_github("meirelesff/genderBR")
```
    
## Data

The surveyed population in the Instituto Brasileiro de Geografia e Estatistica's (IBGE) 2010 Census includes 190,8 million Brazilians -- with more than 130,000 unique first names.

To extracts the numer of male or female uses of a given first name in Brazil, the package employs the IBGE's [API](http://censo2010.ibge.gov.br/nomes/) and, from in 1.1.0 version, also from an internal dataset containing all the names recorded in the IBGE's Census. In this service, different spelling (e.g., Ana and Anna, or Marcos and Markos) implies different occurrences, and only names with more than 20 occurrences, or more than 15 occurrences in a given state, are included in the database.

For more information on the IBGE's data, please check (in Portuguese): 
http://censo2010.ibge.gov.br/nomes/

## Author

[Fernando Meireles](http://fmeireles.com)
    
